The Annals of Statistics 

2007, Vol. 35, No. 4, 1644-1673 

DOI: 10.1214/009053606000001613 

© Institute of Mathematical Statistics. 2007 



ASYMPTOTIC APPROXIMATION OF NONPARAMETRIC 
REGRESSION EXPERIMENTS WITH UNKNOWN VARIANCES 1 

By Andrew V. Carter 

University of California, Santa Barbara 

Asymptotic equivalence results for nonparametric regression ex- 
periments have always assumed that the variances of the observations 
are known. In practice, however the variance of each observation is 
generally considered to be an unknown nuisance parameter. We es- 
tablish an asymptotic approximation to the nonparametric regression 
experiment when the value of the variance is an additional parameter 
to be estimated or tested. This asymptotically equivalent experiment 
has two components: the first contains all the information about the 
variance and the second has all the information about the mean. The 
result can be extended to regression problems where the variance 
varies slowly from observation to observation. 

1. Introduction. We will show that a nonparametric regression exper- 
iment where the variance is unknown (and possibly changing) is asymp- 
totically equivalent to a continuous Gaussian process. This equivalence is 
demonstrated by the explicit construction of the continuous Gaussian pro- 
cess from the nonparametric regression observations and vice versa. 

In particular, a simple version of the nonparametric regression problem 
observes n independent normals, 

(1.1) Y i = f(i/n) + o-£ i , i = l,...,n, 

where / is an unknown smooth function that we want to estimate (or test), 
the £j are independent standard normals and a 2 is the variance of the noise. 

Brown and Low [2] showed that this nonparametric regression problem is 
asymptotically equivalent to trying to estimate / in the white-noise experi- 
ment that observes the continuous process 

(1.2) Y(t)= /* f(x)dx + -^=W(t), 0<t<l, 
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where Wit) is a standard Brownian motion (SBM). Brown and Low [2] 
assumed that the variance structure was known, and their construction of 
the white-noise process Y(t) depends crucially on the value of a. In practice, 
however, we typically do not know the value of a, and it is usually considered 
a secondary "nuisance" parameter. Its estimation is only necessary to the 
extent that it calibrates the estimation of / (as in setting a threshold level or 
bandwidth). Our approach is to include the variance as a second parameter 
so that the experiment is now concerned with decision procedures concerning 
the pair (f,cr). While it is too strong to assume that a is known, we may 
be erring in the other direction by promoting its importance to the same 
level as the mean function. However, Theorem 1 shows that there is no 
significant penalty to pay in treating the variance as part of the parameter 
space because the equivalence holds for essentially the same spaces as in [2] . 

Our motivation comes from wavelet thresholding techniques (e.g., [5, 7]) 
that estimate the variance using the high frequency wavelet coefficients and 
estimate the mean mainly from the low frequency coefficients using the es- 
timate of the variance to determine which terms to include in the model. 
A similar approach is used by Rice [21] to choose the bandwidths of ker- 
nel estimators. Our asymptotic approximation contains two components: a 
X 2 -distributed random variable with information about the variance, and a 
continuous Gaussian process with information about the mean. 

This sort of approximation is also available when the variance is a function 
over the unit interval, = f(i/n) + a(i/n)£i. In this case, the strategy is to 
separate the Y^s into groups such that within each group the variance func- 
tion is nearly constant and then to proceed as in the constant variance case. 
Not surprisingly, if the variance is also to be nonparametrically estimated, 
the equivalence result is only true under somewhat stricter conditions on 
the means. 

1.1. Asymptotic equivalence. The proposed approximation is in the sense 
of Le Cam's deficiency distance between statistical experiments [15]. This 
type of approximation provides a correspondence between estimation pro- 
cedures in each experiment such that any good estimator in the asymptotic 
approximation corresponds to a good estimator in the nonparametric re- 
gression estimator and vice versa. 

In this formulation, we have a pair of statistical experiments V and Q that 
consist of sets of distribution functions {Pj | / € J-} on (X,A) and {Qj | / G 
J-} on (y, B). Both are indexed by the same parameter set J 7 , and the two ex- 
periments are equivalent if they provide the same information about / G T . 
Le Cam proposed a pseudometric for statistical experiments A(V, Q) = 
max[5{V,Q),5(Q,V)] where 5{V, Q) = M K sup f \\K(P f ) -Q/||tv, using the 
total variation distance and "transitions" K that map distributions on the 
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sample space of V to distributions on the sample space of Q. For our pur- 
poses, however, Le Cam's general notion of transitions (see [16], page 18) is 
not necessary. Instead, we will bound S(V, Q) by proposing a randomized 
transformation of the observed data. Thus, K can be represented by the 
conditional distribution on (y,B) given an observation from Pj, and K{Pf) 
is the marginal distribution on y. 

Therefore, the first step in bounding this A-distance is to propose a 
candidate transformation from X to y. Then the bound is established by 
bounding the distance between the distributions of the transformed obser- 
vations and the observations from the approximating experiment. Explicit 
transformations between the experiments are useful because they generate 
a correspondence between the estimators in experiments. For instance, if 
the distribution of T(X) is close to that of Y, then the estimator f(T{X)) 
has nearly the same risk as f(Y). The transformation T may be random- 
ized in the sense that it may depend on some external random variables, 
but it may not depend on the parameters. In particular, the transformation 
in [2] depends on the variance a 2 which is now a part of our parameter 
space. Therefore, we must formulate a different transformation that does 
not depend on a 2 . 

Two sequences of experiments Q n and V n are asymptotically equivalent 
if A(V n ,Q n ) — > 0. Asymptotic equivalence implies that the risk under a 
bounded loss function achieved by any estimator in V n can be achieved 
asymptotically by associated estimators in Q n and vice versa [15]. 

1.2. Main results. 

1.2.1. Constant variance. In order to accommodate the added aspect of 
an unknown variance, the parameter space will be expanded to include both 
the smooth functions / and the variance a 2 . The specification of the exact 
set of parameters is described in Section 2. For Theorem 1, the parameter 
space T a includes all functions in a Holder space with a > 1/2 as in [2]. 

Theorem 1. Suppose that the experiment V n observes Y{ as in (1.1). 
The distributions are indexed by (/, a) € T a x R + as in Definition 1. 

Further, suppose that the experiment Q n has distributions indexed by the 
same pairs (/, a) with V ~r(5-,^-) and 
ft 

(1.3) Y(t)\V= f{x)dx + V l/2 n- 1 l 2 W{t), 0<i<l, 
J o 

where W(t) is a SBM. 

Then the experiments V n and Q n are asymptotically equivalent, A(V n , 

Qn)^0. 

The proof for simplified versions of these experiments is in Section 3, and 
the rest of the argument is in Section 5. 
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Remarks on Theorem 1. Note that the experiment Q n is equivalent to 
observing Y(t) alone because the random variable V can be computed almost 
surely from Y(t) via its quadratic variation. This is why the variance of Y(t) 
is random as opposed to just a 2 /n as in [2]. Q n is like Le Cam's locally 
asymptotically mixed normal experiment ([17], page 121). 

One implication of this approximation is that asymptotically V is suffi- 
cient for estimating a 2 . Further, because the distributions of Y(t) conditional 
on different values of V are mutually singular, the conditionality principle 
implies that inference for / should be performed conditional on V. For in- 
stance, minimax results for the white-noise problem similar to those in [20] 
or [6] bound inf ^sup^^ KfL(f, f) = lZ(T a ,o 2 ) for observations as in (1.2). 
This implies a bound on the risk in estimating / in (1.3) using the expected 
value of the conditional risk, WR,(T , V). Theorem 1 implies that the same 
asymptotic minimax result also applies to V n (for bounded loss functions). 

The advantage of approximating the regression observations in (1.1) by 
the continuous process is that it makes certain calculations easier. For in- 
stance, in the experiment Q n a linear estimate of the mean / at a point 
t which is of the form Y(Kh), where is a kernel function with band- 
width h, has a normal distribution with mean J Kh{x)f{x) dx and variance 
— / Kf l (x)dx conditional on V. The bandwidth h should be chosen as a 
function of V to minimize the error, and even knowing the exact value of a 
would not provide a better bandwidth. This is very much like the approach 
to choosing a bandwidth described in [21]. 

1.2.2. Varying variance. It may be more interesting to consider nonpara- 
metric regression experiments where the variance changes over the interval. 
Here the parameter space is the product of two function spaces, T a and 
E. The varying variance means that more smoothness is required in T a , in 
particular T a includes Holder spaces only for a > 3/4 and E includes func- 
tions o~(t) such that log<r(£) is Holder for a > 1. The exact definition of the 
parameter space is in Section 2. 

Theorem 2. The experiment V n observes Yi = f(i/n) + a{i/n)^i for 
i = l,...,n with £j as in Theorem 1 and (/,cr 2 ) £ T a (a,^k) x E(ai), the 
parameter space in Definition 2. 

The experiment Q n ,m observes two Gaussian processes: 




and then conditional on V(t) 



dY(t) = f(t)dt + Z e n 



1/2 dW!(t) 



(£-l)/m<t<£/m, 



ASYMPTOTICS FOR UNKNOWN VARIANCES 



5 



as 1 = 1, ... ,m, where Zi = exp[Tp(V(-£/m) — V([£ — l]/m))]. The processes 
Wi(t) and W2(t) are independent SBMs. 

For any a > 3/4 and ct\ > max(l, 2 a-i )> there is a sequence m n (where 
n i/3 ^ mn ^ n i/2 Spending on a an d a\ ) such that these experiments are 
asymptotically equivalent, A(V n , Q n ,m„) 0. 

The proof of a simplified version of this result is in Section 4 and the rest 
of the proof is in Section 6 and depends on asymptotic results in Sections 7 
and 8 that bound the distance between the simplified experiments and V 
and Q, respectively. 

Theorem 2 requires a bit more smoothness on /, but we can trade off a bit 
of smoothness in the set of mean functions for less smoothness in the variance 
space. In particular, if / is a Holder function for some 3/4 < a < 1, then 
log c 2 (re) needs to be a Holder function for a\ > 2a/ (2a — 1). This always 
implies that a\ > 1 and the variance function has one bounded derivative. 

Sections 9 and 10 give bounds on the K-L divergence between gamma 
and normal distributions that will be used in the proofs. Further technical 
lemmas are established in Sections 11, 12 and 13. 

1.3. Related work. The equivalence results of Brown and Low [2] were 
the first to apply Le Cam's deficiency to a nonparametric regression exper- 
iment, and their introduction provides a number of further references moti- 
vating this approach. Brown, Cai, Low and Zhang [1] extends their results 
to the case where the design points are randomly chosen uniformly over the 
interval. Rohde [22] uses a Fourier series decomposition to get better results 
in the approximation of Brown and Low [2]. Carter [3] extends the fixed 
design result to the unit square. All of these results assume that the errors 
are normal with a known variance. Brown and Low [2] discuss how to adjust 
their results to cases where the variance changes over the interval, but it is 
essential to their methodology that the experimenter know the variance at 
each observation. 

Grama and Nussbaum [10] discusses nonparametric regression problems 
with nonnormal errors. In particular, one of the cases they treat is estimat- 
ing the variance of normal observations. Zhou [24] treats the variance case in 
particular and improves the bounds to apply to Besov spaces. The Q experi- 
ment in Theorem 2 reduces to the continuous Gaussian experiment from [10] 
if the mean function is assumed known. Therefore, Theorem 2 synthesizes 
the results of both [2] and [10] (under stronger smoothness conditions). 

The most interesting applications of our work may be in heteroscedastic 
nonparametric regression, of which there is a considerable literature, (e.g., 
[4, 9, 11, 13, 23]). The variance estimator of Miiller and Stadtmuller [19] 
seems closest to these results in that the mean squared error is estimated 
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on a fine grid, thus producing approximately the V(t) or Zi of experiment 
Q n ,m, an d then these observations are smoothed to produce an estimator of 
cr 2 (t). Hall, Kay and Titterington [11] improve on this technique by finding 
the best linear functions of the Yi which can be squared and averaged to get a 
estimate of a 2 . Other types of estimators from Fan and Yao [9] and Ruppert, 
Wand, Hoist and Hossjer [23] are based on residuals from a preliminary fit 
of the mean. Testing for heteroscedasticity is addressed in Dette and Munk 
[4] based on differences between successive observations, while Eubank and 
Thomas [8] use residuals. 

2. The parameter spaces. The most convenient way of describing these 
parameter spaces is using wavelet bases and their associated Besov se- 
quence norms. Assuming that the mean functions / are in L 2 ([0, 1]), let 
4>k : j for j = 1, . . . , 2 fe and ipij for i > k and j = 1, . . . , 2* be the scale func- 
tions and wavelets, respectively, for an orthonormal wavelet basis on [0,1]. 
For most of our arguments it is necessary that these are the Haar basis: 
<t>k,j(t) = 2 k / 2 cf) (2 k t - j + 1) where (f> (t) = 1{0 < t < 1}, and ^ is defined 
analogously with ^ (*) = 1{0 < t < ±} - 1{\ < t < 1}. 

The coefficients $kj and Oij are such that 

2 fc oo 2" 

(2.i) /(^E^^W+EEMyW' 

j=i i=kj=i 

These coefficients can be found via $ij = (/, i[>ij) = J fipijdx. The parame- 
ter space is the set of smooth functions that can be described succinctly by 
the basis functions <pk,j and ipij- 

There are two Besov sequence norms we will use on the series of coeffi- 
cients. The b(a, 2, 2) norm is 

\ j i>h j 

and the 6(a,oo,l) norm is 

l|0|| 6 (a,oo,l) = SU P 2 fc («+ 1 /2)|^. | + ^ 2 ^+ 1 /2) sup |^.|. 

3 i>k 3 

These norms are equivalent to Besov norms (see, e.g., [12], Chapter 9). 

The parameter spaces we will use are compact in these norms. This implies 
that there is a uniform bound on the partial sums in each norm. Specifically, 
for a sequence of positive 7^ that goes to as k gets large, let 0(a, 2,2,7^) 
be the set of all sequences 6ij such that 



(2.2) 
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< Ik- 



Analogously, 

SUp k^+^SUplfly 

0ee(a,oo,l,7fe) [i>k 3 

The results in Sections 3 and 4 require that the mean functions are in 
0(a,2,2,7fc) while the approximations in Sections 5 and 6 require that the 
means are in 0(a, oo, l,7fc). Holder (M, a*) functions with a < a* < 1 are in 
both spaces with j k = M2 k( - a - a *h 

Definition 1. Using the Haar basis functions, the parameter space for 
Theorem 1 is 

f a ( lk ) x M+ = {(/, a 2 ) : / G 6(1/2, 2, 2, lk a) n 6(1/2, oo, 1, lk a), a 2 > 0}. 

When the constant variance is replaced by a function c 2 (i), then greater 
smoothness in the mean function is necessary. The log of the variance func- 
tion is assumed to be a Holder function with a\ > 1. Let r(t) = log<r 2 (i). 
Then the parameter set H(M, a\) includes all such r where 

(2.3) sup \r'{t)\ < M and sup \r'(t) - r'(s)| < M\s - t\ ai ~ l . 

te[o,i] t,se[o,i] 

Furthermore, let logcr 2 = Jq log a 2 (t) dt . 

Definition 2. The parameter space for Theorem 2 is 

7fc) xS(ai) = {(/,a 2 ):/(t)Ge(a,2,2, 7fc a)ne(a,oo,l, 7fc a), 

and log(a 2 (t)) £ H{M,ax)}, 
where again the basis is assumed to be the Haar basis. 

Remark. There are two tricks used here to avoid the condition <J 2 (t) > 
e > 0. First the smoothness of the functions is measured relative to the vari- 
ance in that the tail of the Besov norm has to decrease proportionally with a. 
Also, the smoothness condition on the variance is on the logarithm of cr 2 {t) 

as opposed to & 2 (t) itself, thus supQ <t< ^ 

I log ^ | <M by the mean value 
theorem. We avoided a lower bound on the variance so that the experiments 
will still be invariant under rescalings. 

3. Sequence space result. Instead of working directly V n , we will first 
consider an experiment based on the orthonormal basis functions. The ex- 
periment V k observes n = 2 k independent normals 

Xoj~M(#koj,^) for j = l,...,2 fc °, 

J a 2 \ 
Xij ~ M ( 9ij, — ) for ko < i < k, 
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where the °&k Q ,j and 9ij are the wavelet coefficients of the mean function / 
as defined above. 

The experiment P^ observes the entire sequence of normal random 
variables. This sequence experiment is equivalent to the experiment that 
observes the Gaussian process from (1.2) because the random coefficients 
can be generated from this process via Xqj = J 4>k ,j(t) dY(t) and Xij = 
/ ipi,j(t) dY(t), and the process can be constructed from the coefficients via 

(2.1) using 6ij =Xij. Unfortunately, this experiment Voo is completely in- 
formative with respect to estimating a 2 and therefore cannot be asymptot- 
ically equivalent to V n except in trivial cases. 

Instead, Pk is approximated by Q which replaces a 2 in Voo with a chi- 
squared observation. The experiment Q observes the random variables 

Lemma 1. For the parameter set F a (-/ k ) xR+ = {(0,a 2 ):9e 6(1/2,2,2, 
7fccr), a 2 G R + } and n = 2 k , m = 2 k ° and m = njk , 

(3-1) A(V k ,Q)<2^ 2 . 

If jk = M2~ ek for some small e between and 1/2, then 

(3.2) A(P k , Q) < 2M 1/2 n- £/{2{ - l+e)) . 

Clearly, (3.2) follows directly from (3.1) using m = M^^+^n 1 ^ 1 ^. 

This lemma and its proof imply that there is a sense in which the x 2 ran- 
dom variable V and the scaling function coefficients Yqj are asymptotically 
sufficient statistics for these experiments. Sections 3.5 and 3.7 argue that the 
information about / in the Yi and X^j is negligible and the information 
about the variances is summarized in V. 

The bound on the deficiency 5(Pk, Q) is described in Sections 3.2-3.6 and 
the bound on 5(Q,Pk) is in Section 3.7. 

3.1. Using Kullback-Leibler divergence. The total variation distance mea- 
sures the distance between distributions in the deficiency distance, but total 
variation is inconvenient for the analysis especially in the case of product 
measures. It is easier to establish the bounds using Kullback-Leibler diver- 
gence because the divergence between the joint distributions of X and Y is 
equal to the divergence between the marginal distributions of X plus the 
expected value of the divergence between the conditional distributions of Y 
given X. In particular, for product measures D(FJi Pi, Hi Qi) = J2i D(Pi,Qi). 
We also have the bound ||P — Q||tv \/D(P, Q) which allows us to apply 
K-L bounds to the deficiency distance. 
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In a convenient abuse of notation, we will use D(X, Y) to refer to the 
divergence between the distributions of X and Y. This allows us to write 

U((X 1 ,Y 1 ,),(X 2 ,Y 2 )) = U(X 1 ,X 2 ) +E[U(Y 1 ,Y 2 \ X)}. 

Furthermore, ~D(X,Y) < E[D(X, Y | V)], which means that the divergence 
between the marginal distributions of X and Y is less than the expected 
value of the divergence between the conditional distribution of X given V 
and the conditional distribution of Y given V. 

3.2. Hierarchical structure of the experiments. The strategy is to con- 
sider the Xqj and Yqj observations as those containing the information 
about the mean, and the rest as having information about the variance. The 
Q experiment consists of observations of V, Yqj and Y^j for i > ko as above. 

We can construct a parallel structure from the Vk observations, 



n-m , ^-t, ^ J 

KQ<l<K J 



%,i = Xoj | V ~ M (^ 0ij , ^) , fij | V ~ M (o, V - 

where the Yj are generated independently conditional on the estimated 
variance. 

In both experiments, the conditional distribution given V of the wavelet 
coefficients, 1$ is independent of the distribution of the scaling function co- 
efficients Yqj. Thus, there is a decomposition of the bound on the divergence 
into three terms, 

D(Q,iT(P)) = T>(V,V) + J2nV(Yo, j ,X , j | V)] 

i>k j 

We will bound the contribution from each of these terms in Sections 3.3, 3.4 
and 3.5, respectively. 

3.3. The variances. The construction first generates an estimate of the 
variance. This estimate, V, is o~ 2 /(n — m) times a noncentral y 2 random 
variable with n — m degrees of freedom and noncentrality parameter fj, = 

n Yjk <i<k Yjj ®i,j/ a ■ 

The distributions of V and V are approximately gamma with a = (n — 
m)/2 and n/2, respectively. The distribution of V is approximate because 
the normals have nonzero means. Ignoring that inconvenience for a moment, 
Lemma 6 bounds the divergence by D(V, V") ~ m 2 n~ 2 . 
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The effect of the noncentrality of V on the bound can be handled using a 
mixture distribution characterization of the noncentral x 2 - A noncentral x\ 
with noncentrality parameter [i can be generated from a Poisson mixture 
of x 2 distributions with 2A + n degrees of freedom with mixing parame- 
ter A ~ Poisson (/i). For the K-L divergence between mixture distributions, 
D(V, V) < E M D(y, V | A) where V is independent of A. From (9.6), 

m 2 / A 2 A \ m 2 /j 2 + 3 (j, 



E,V(V,V | A) <— , + __ + _<_ + 



2n 2 \2(n — m) n — lj 2n 2 2(n — m) 

The size of /j will be shown below in (3.7) to be nj^/m, and therefore only 
the first term in this bound will concern us, 

2 

„ Tfl 

(3.4) D(V, V") < — k + smaller order terms. 

2n z 

3.4. TTie top Zewe/. The broadest coefficients Xqj are equated directly 
to the Yqj, and the distributions are normals with means $ko,j- The only 
difference between the two sets of coefficients is the variance: for the Yqj it 
is V/n, and for the Xqj it is a 2 jn. 

Therefore, from (10.1), 



J2nn(Y 0!j ,x 0d \v)] = jE 

j 



v 1 1 ( v 

-5 - 1 - log [-= 



(3-5) <^log( 



< 



2 °\n 
m 



n-2 

using Jensen's inequality on Elogl/V. This is actually larger than the error 
in (3.4), and they both imply that m = o(n). 

3.5. The bottom levels. The third and final term in (3.3) compares the 
wavelet coefficients. The coefficients Yij are uninformatively generated by 
random zero-mean normals with variance V/n. 

The difference between one of these normals and a normal generated 
by the experiment Q (conditional on V) is in the difference of the means, 
T>(Yi tj ,Yi tj | V) =n9 2 j V~ 1 /2. Thus the total error is 

(3.6) £ £ e[d(y m , y M I V)] = e|<«EE| 

i>k j ^ ' i,j i>k j 

Using m = 2 k ° , 

v ' f-f ^ a 2 m ^ a 2 m k ° 

i>ko 3 «>fc J 
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3.6. Choosing m. Choosing the dimension of the scaling functions bal- 
ances the errors in the approximation of the scaling and the wavelet coeffi- 
cients. The trade-off is between the bound in (3.5) (m/n) and the bound in 
(3.7). Minimizing the bound is then possible by setting the two terms equal 
to each other, 



m n 



m = n-i ko -o- + — + — 7fcn <37fc - 



2 "co 

n m n £ n m 



Therefore, plugging (3.4), (3.5) and (3.7) into (3.3) for m = njko gives 

1/2 



3.7. The transition in the other direction. The other half of the defi- 
ciency distance bound requires a map from the (V,Yij) observations from 
Q to the Xij observations from Vk- Once again, the top level observations 
remain unchanged, Xo,j = Yo,j ~ AA(i?fc ,j) V/n). The Yij for i>ko are not 
used in the transformation; instead the Xij are functions of the variance 
V. Because V is a sufficient statistic for estimating a 2 from n independent 
normals with mean 0, there is a probability distribution conditional on V 
that is not a function of a of n independent M(0,a 2 jn) random variables 
from which we can use the first n — m as Xij . 

The Xij and Xoj are not independent because they both depend on V, 
but we can bound the K-L divergence via 

D(X,X) = 2J y^D(Xj ; j, JTi <7 ) +E^D(A > oj,X j | {^i,j}fc<i<fc ,j) 

k<i<k j j 
k<i<ko j j 



The contribution to the error from the second term is just as in (3.5), and 

1/5 
/fco 



1/2 

the first term is less than the error in (3.7). Therefore, 5(Q,V k ) < 2jZ and 



Lemma 1 is established. 

4. A variance function over the interval. An interesting extension of the 
result in Lemma 1 is to consider what happens when the variance changes 
over the interval. Our simplified version of this experiment assumes that we 
can group the basis functions into mi = 2 kl groups for k\ < ko, 



and then each group of coefficients will have a different variance, a 2 . These 
groups are chosen so that each Haar basis function ipi ~ with 6 2^ has 
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support in (i — l)/m\ < t < ljm\. The variance of the group will be deter- 
mined by the variance function o~ 2 {t) via 



log o~ 2 (t) alt. 



Lemma 2. The parameter space T„ x S contains the mean functions 
f(t) G 0(a, 2, 2,7fca) /or some a > 3/4, and the variance functions o~ 2 (t) are 
such that sup s max£(logcr| — log a 2 ) < M. 

The experiment V"k observes n independent normals, 

Xij ~nUij, ^\ for (i,j) e It, k <i< k. 
The Q experiment observes m\ independent Vg ~ L(t^-, 2o "^ mi ) an( j[ 
Y Q j\V~M(# ko j, ^) for (k ,j) e X t , 
Y id \V ~J\f(()ij, ¥l) for €l t ,i> k , 



n 

where the normals are all conditionally independent. 
Then for m = 2 k ° , 

A(V k , Q) < 2m 1 1 /2 ml /2 n- 1 / 2 + e M / 2 m^n l l 2 lho . 

This bound is of order j^o w hen mo = n 1 ^ 2 ") and m\ = n 1-1 ^ 201 ^^. 

The basic idea is that the argument for Lemma 1 can be repeated on each 
of the mi independent pieces of this experiment. A Haar basis is used here 
because the basis functions have disjoint support which keeps things tidy as 
the variance changes over the interval. 

Comparing this to Lemma 1, the approximate sufficient statistics are the 
mo random variables Yqj and the mi variances Ve where mo < m from 
Lemma 1 because the / are smoother and mi < mo because the variance 
functions are smoother still. 

4.1. Proof of Lemma 2. The transformation of the Xij follows as in 
Section 3 on each of the mi pieces. First, there are estimates of the variances 
based on the observations for i>ko, 

v l = ^- £ xt,. 

n — mi .... , ' J 
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Then new Gaussian observations for i>ko are generated by independent 
normals with variance Vg_. 

The error in the approximation is bounded in the same three stages: first 
the error in the estimation of the variance, then the difference between the 
distributions when i = 0, and finally the distance between the distributions 
of the observations for i>ko, 

m mi 

(4.1) 

mi 

+ EE E E[D(y M ,y M |v,)]. 

e=ii>k {j-.(i,j)ei e } 

Each of the estimates V is m\a\j[n — ra\) times a noncentral x 2 with 
(n — mi) /mi degrees of freedom and noncentrality parameter 

n0? ■ 

f J, t = E E — T~> 

k <i<k{j:(i,j)el e } °> 

which is small for large k$. By (9.6), the divergence between the distributions 
of Vi and Vt is 

D(^^)<-^ + ^M±^). 
2(n — miY n 

Using independence, the divergence between the distributions of the entire 
vectors is just the sum, 

m 3 q / m \ 

(4.2) ED^7 4 )<^ + ^ (5> . 

These terms will turn out to be negligible relative to the errors in the other 
two terms. 

The second term in (4.1) is the divergence between the conditional dis- 
tributions of Xqj and the Yqj. By (10.1) and Jensen's inequality, 



— — -1-Elog 



E\D(Y j,X 0tj \V t )] = ± 
Thus the bound on the sum is 

mi 

(4.3) Yl E E[D(y 0j -,^oj|^)]< 



mi 



n — 2mi 



mi mo 



77 

<=l{j:(*oJ)eZ<} 
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Finally, the third term in (4.1) is the divergence between the conditional 
distributions of the Yi j's and the ijj's. By (10.1) 

mi mi o2 

EE E Wi.i'll=EE E e-m. 

£=1 i>fc {j : (i,j)ei e } £=1 i>k {j : (i,j)el t } 1 

In this case, EV^ _1 = crj 2 (l — 2 ^ L )~ l < e A/ <T~ 2 by the smoothness condition 
in Lemma 2. Thus 

mi zn2 

EE E iiD^^i^i^e^EEl- 

^=ii>fc {i: (j,i)ex £ } i>feo i 

Here we will use the smoothness properties of the function space to get the 
bound 

(4-4) n E E ff < — o 2Q E E ff < — o 2Q 7io ■ 

t>fc j i>k j 

Therefore, plugging in (4.2), (4.3) and (4.4) into (4.1) yields 
D(Q,FiO<^ + e-^ 7fc 2 , 

n 777,0 u 

and the deficiency is 

5(V, Q) < 2mj /2 mJ / V 1 /2 + e M l 2 n x ' 2 m^ a lko . 

4.1.1. The transformation in the other direction. The proof of Lemma 2 
is completed by bounding the deficiency in the other direction. Following 
very much what we did in Section 3.7, each Vi can be decomposed into 
n/m\ independent normals with mean and variance of to create a set of 
observations (conditional on Vi) 

for (i, j) £ Xi and k§<i < k. 

The divergence between these distributions and the distributions in Vk is 
less than 

mi mi 

EE E l) <- V -,r-V;. ; ) • E E ED(X 0j ,X 0ii |^), 
1=1 k<i<k {j : {i,j)eit} e=i {j ■. (k ,j)ei e } 

where the first term is bounded as in (4.4), and the second term as in (4.3). 
Therefore, 

6(? k , Q) < mT^n-y + e M /Wn^ lk0 
and the proof of Lemma 2 is finished. 
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5. The difference between the sequence and sampled experiments. The 

sequence results in Lemma 1 and 2 assume that the observations are n 
wavelet coefficients. This is unrealistic, and we would prefer that the obser- 
vations were of the form (1.1). The standard technique is to use Ygf ' \fn as 
approximations to the scaling coefficients at the lowest level in the wavelet 
expansion ~dk,t- 

The cascade algorithm of [18] constructs higher frequency scaling function 
coefficients from the scale function and wavelet coefficients. This construc- 
tion can be used to generate ~ JV^ra 1 / 2 ^^, a 2 ) from the Yi j, and the 
construction can be inverted to construct the wavelet coefficients from these 
scaling function coefficients at level k. 

The error in this approximation is in the difference between the means, 

To be concrete, take the orthonormal basis to be the Haar basis. For / a 
continuous function, 

(5.1) f(i/n) = n 1 ^ ktt + E ^*2 { *- 1)/2 , 

i>log 2 n 

where the j*'s are the indices of the wavelets such that \4>i t j*(£/n)\ > 0. 
The K-L divergence bound becomes 



=1 \ i>log 2 n 



(5-2) <£( E 2^)/ 2 sup|M) 



<\ E 2 l su P 

^ i>log 2 n 3 




for mean functions in G(l/2, oo, 1, 7fc<r). This function space is different than 
required in Lemma 1 but still includes any Holder space for a > 1/2 as in 
[2]- 

Theorem 1 is proven by first appealing to the triangle inequality for A, 
A ("P, Q) < A(V,V) + A(P, Q). Then, for the parameter space that is the 
intersection of the two spaces required for Lemma 1 and (5.2), A(P,P) < 7^ 
from (5.2) and A(P, Q) < 2M 1 / 2 7fco by Lemma 1. Therefore, A(P, Q) < 
3M 1 / 2 7fc —> 0, and Theorem 1 is established. 
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6. Proving Theorem 2. There are three asymptotic results that can be 
combined to establish Theorem 2. The first is Lemma 2, which showed 

A(V k , Q) < WV^o + 2m\ /2 m l /2 n-^. 

We need two more approximations. First the V experiment needs to be 
approximated by the nonpar ametric regression experiment. 

Lemma 3. We have 

A(P k ,P) < 2e M / 2 (n 1 /2 m -« 7fco + n ^ m -^ 

1/2 -1 . 1/2 -1 , 1/2 -3/2x 

+ m m 1 +n 1 m +n ' m x ). 
This is proven in Section 7. 

Finally, we need to approximate Q by a continuous Gaussian process. 
Lemma 4. 

A(Q, Q) < min" 1/2 + 2Mn 1/2 m^ ai + Mn 1/2 m^ 3/2 . 
This result is shown in Section 8. 

These three results can be combined using the triangle inequality for the 
A distance to prove that 

AfP, Q) < Cn l / 2 mQ a lko + Cn 1 /^^ 1 + Cm^m^n^l 2 + inj 2 m^ 1 + ■■■. 
Let Co = log n mo and Ci = log n mi so that if 

Co > 77-, Ci>^-, Co + Ci < 1 and Co<2Ci, 

then A(V, Q) — > 0. The conditions can only be fulfilled if a > 3/4 and ai > 
a /(2a — 1). Of course the argument assumes all along that a < 1 and ai > 1. 
For I < a < 1 and 1 < a± < |, we could take mo = n 1 ^ 2 ") and mi = 
l/(2ai) E h _ l(2«ni _ J_) so that 

1/2 -a , 1/2 -ai , 1/2 1/2 _l/2 . 1/2 -\ 

n ' itlq 7/% + n 1 m x + m-y m ' n ' + m ' m x 

= 7fc + n ~ £ai + n ~ e/2 + n 1 /^ -1 ^ 2011 ^ -6 , 

which goes to as n — > oo because e > and 4a > 2ai . The other terms in 
the bound on A(V, Q) are also negligible, 

n 1 / 2 m " 1 +n 1 / 2 m" 3/2 = „(«-!)/(**) +n (2a 1 -3)/(4a 1 ) n - 3e /2 j 

which goes to because a < 1 and ai < §. Finally, if n 1 / 2 mg 1 — ► and 
momin" 1 — > then clearly m-in" 1 / 2 — ► 0. 

Therefore, Lemmas 2, 3 and 4 together are sufficient to prove Theorem 2 
where the sequence m n = n l ^ 2ai ^ +£ . 
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7. Sequence space result for changing variances. Lemma 3 compares the 
experiment Vk which observes n wavelet coefficients to the experiment V 
which observes n normals with means f(i/n) and variances a 2 (i/n). The 
approximation will be established in a series of steps by establishing inter- 
mediary experiments that are equivalent to both experiments. 

7.1. Negligible wavelet means. The first approximation of V is by VI, 
which observes the Xqj the same as in V but observes X*^ that have ex- 
pectation zero. The divergence between the joint distributions is 

m&ojMXij})* ({*o,i}, {x*,})) =nj2Y, e -k> 

i>k j G L 

which is the same as the bound in (4.4), and therefore it goes to for n 
large and 

(7.1) AiV^VlJKe^n^m^. 

A sequence of zero-mean normals with an unknown variance has the sum 
of the squared observations as a sufficient statistic. In particular, V\ is equiv- 
alent to observing 

(jf\ y ( n — tuq 2m\o 2 



for £ = !,..., mi, where 



2m\ ' n — m,Q 



y _ nmi v , 2 



n — rriQ.. 

7.2. Distributing the variances. Next, we consider the experiment V^ 
that has mo variances instead of m\. It has observations 

' J \ n J J \ 2mo n — moJ 

all independent for j = 1, . . . , thq. The new variances are 

log = [Ce,j log crj + Ce+ij log aj +1 ] 

for {21 — l)/2m\ < j/mo < (2£ + l)/2m\ where Qj + = 1 are weight 

functions defined below. 

This experiment is generated by smoothing out the variance information 

in the Vi ofV*. The transformation of the observations from V* leaves the 

observations X o j alone and uses the to generate the y 2 random variables. 

The trick is to redistribute this information in a smooth way to produce the 

V* 
o 
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7.2.1. The transformation of the Vg. We decompose each Vg into 2mi/mo 
gamma observations (with a small correction for t = 1 and £ = mi). The 
technique depends on the fact that a random variable X ~ T(a,(3) times a 
beta random variable £ ~ B(5a, [1 — <5]a) is X£ ~ r(<5a, /3) and is independent 
of X(l — £) ~ r([l — 5]a,/3). This can be extended to a multivariate beta 
distribution. For j between (2£ — 3)mo/(2mi) and (2£ + l)mo/(2mi) and 
parameters <5^j such that X^<^,j = 1> the density of on the simplex 
£j 6 j = 1 is 



/(€) 



m 



mi 



n r 



n — mo 
mi 



iij((n-mo)/mi) 



so that the &jVg are independent gamma random variables with a = 5(j((n 
mo)/2m\). 

The variance terms for V 1 ^ are constructed via 



V* 



m 



+ for 



2f 



1 j 

< — < 

2?rii mo 



2£+l 



mi " ~imi mo 2m i 

which is a sum of gamma random variables. The weighting parameters are 



mi + \ 

m 



2j-l 



mi 



1 



m 



1 



mi 



m 



m 



2£- 1 



mi 
m 



mi 
m 



Thus ^6 ,Vi ~ rf^O 7, ^L). 

mi st iJ 1 \ 2mo n— mo ' 

On the edges of the interval, for j/mo < l/2mi and j/mo > 1 — l/2mi, 
the weights are simply 5ij = S mi ,j = m i / m o- There is no smoothing on the 
edges of the interval. 

This is a somewhat involved transformation, and the divergence between 
the generated observations and the observations is 



mo 



'»() 



(7.2) 



2jD(X 0i: ,-,Xoj y 



mo 



mi 



[& d V t + ti i+ ijV i+1 },V* 



>)+E D l 

i=i j=i 

The first term can be bounded by noting that the logarithm of the variance 
function has a derivative bounded by M, and thus 



loga| — log erf | 



(7.3) 

which, along with (10.1), implies 



,,2 2, M 

O+i ,j I logo> + i - logo> | < , 

mi 
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Thus the total error from the first term in (7.2) is less than moM 2 m^ 2 . 
The contribution to the error from the edges is zero because a 2 = a 2 for 
j/mo — l/2mx or j/rriQ > 1 — l/2mx- 

For the second term in (7.2), K* is a sum of gammas with different scale 
terms. We need a general lemma on the distribution of gamma sums. 

Lemma 5. For independent X\,...,X m random variables with Xi ~ 
T(5in,af /n) where the 5i > and = 1> ^ e distribution of the sum 

of the XiS is approximately gamma, 



a 2 \ m 
Y^Xi^Y ~r(n, — ) where a 2 = J[a 2S \ 



' i=l 

For Ti = logo"? — log o , the divergence between the distributions is bounded 
as r i — > by 

m r 4 2 

D (E < E ^p- + x + + i^i 3 + r ^ n " 1 )- 

The proof of this lemma is in Section 11. For ri = log(<7 2 /<rJ), Lemma 5 
implies 

D ^*' W * (^) (to 4 + Wti) + ±L t ±1 + • • • • 

Using (7.3) once again, the bounds on the divergences are 

dcv?, v7> < ( =ps) ^ + £ + o ( „™- + ™r 3 ). 

J J V 2mo / mf 2mf 

On the edges the relationship is exact, V* = £e,jVe f° r j/ m o — l/2mi and 
j/iriQ > 1 — l/2m\ where £ = 1 and ^ = mi, respectively. Therefore, the sec- 
ond term in (7.2) is bounded, 



m m -m /2mi 4 2 

(7.5) 



< M 4 nmj~ 4 + M 2 m m^ 2 . 



Putting the two divergence bounds from (7.4) and (7.5) into (7.2) gives 
(7.6) S(V*,V2) < 2Mrn /2 m^ 1 + M 2 n 1 / 2 m^ 2 . 
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7.2.2. The transformation of the V? . To reproduce the V\ random vari- 
ables from the observation of > the chi-squared random variables V? are 

added together to generate approximately the Ve. 

As before, the transformation leaves the Xqj alone and the difference in 
the distributions of the Xq ■ and Xqj contributes a O(mo/m\ ) term to the 
bound. 

The Vi can be approximated by the sum of V* in the set Jt = {j : < 

< ^"}- By Lemma 5, the sum of the ^JVj* will be approximately a 
gamma with expectation 

m (<~l/2)/mi 

log erf = ^ (Ce-i J log erf _ ! + Qj log erf ) 

j=ni (£-l)/m 1 +l 

mo£/mi 

+ X (6 J lo § °i + Ce+i j lo § ) > 

j=mo(£-l/2)/mi+l 

which equals | log ct|_ x + 1 log erf + 1 log erf , x . The correction at the edges im- 
plies that log erf = | log a\ + § log cr| and log erf 1 = § log tj^.j + f log o\ . 

To the error bounded in Lemma 5, we will have to add the error from the 
difference between erf and erf. Let g a (x) be the density of a gamma dis- 
tribution with a = (n — mo)/2m\ and /? = 2m\o 1 j(n — mo). The divergence 
can then be written as 

(7.7) 

+ Elog 5CT;(mi/m ° Eie ^ yj * ) 



g 0t {mi/mo Hjaj t V j 

where is the gamma random variable with density g a * , 
Let Vj = log &j /erf . Then by Lemma 5, 

where the are bounded for (£— l)/m\ < j/mo < (2£ — l)/2m\ using (7.3), 

M 



-log [Qj- - log-/ 



< 
m\ 



For (2£ — X)j2m\ < j/mo < ljm\ % there is an analogous calculation so that 
\rj\ < Mjm\ for every j. For j/m < l/2m\, the \rj\ = | log erf jo\ |/8 < 
M/m\. There is an analogous bound on the other end of the interval as 
well. 
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Thus 
(7.8) 



M 4 (n-m ) M 2 m _ 3 
H : — — + O(m m 1 °) 



16m| 



4m, 2 



The second term in (7.7) is the expected value of 

.2 / x 



5<r; (X) n 

log 



m 



9a AX) 



2mi 



cr 



(77 



The expectation is to be taken over the distribution of the average of the 
V*, 



E 



mi 
m 



mi 



m 



m 



T t+l 



Section 13 does the necessary calculation to bound the contribution from 
this expectation. Prom (13.2) and (13.3), 



nil 



E E1 °g 



2 mi— 1 



9a; (X) _ 3 M 

— —, — - < M^nm, d H > 

9a e ( X ) ~ mi ^ 



nm. 



-2ai 



<=2 



Putting this together with the bound in (7.8), 

5(P 2 *,K) < Mn 1/2 m~ Ql + MmJ^mT/ 1 + Mn 1/2 m~ 3/2 . 
Therefore, in light of the analogous result in (7.6), 

(7.9) A(;P 2 *,K) < Mn 1/2 m^ ai + 2Mm 



1/2 



(1 1 + Mn x l 2 m 



-3/2 



7.3. Approximating the nonparametric regression. The last step is to 
show that is equivalent to the n independent normal observations from 
the original V experiment. 

The V? observations are sufficient statistics for (n — mo)/mo indepen- 
dent normals with means and variances a 2 . These normals will be used in 
place of all the wavelet coefficients -Xjj- These wavelet coefficients are com- 
bined with the Xq p and using Mallat's algorithm, n normal observations 

are produced, Y* ~ j\f(^/mo9k j,a 2 ) where (j — l)/mo < i/n < j'/mo. 
The error made by this approximation is 

Y>(Y*,Yi)- ~~a 2 (i/n) n ,_{o- 2 (i/nY 



j-fji/n)) 2 , 1 
2a? + 2 



a 



1 - log 



j 



a 



,) 



The mean functions are bounded much as with the constant variance case, 

v 2 / „ \ 2 



(7.10) 



E 



(/(*/") ~ y/fnoOkoj) 



< e 



M 



0~ ; 



n 



2 i («+ 1 /2) 

. i>ko 



sup 

3 



'i,3 



a 



< e 



M 



11 



m 



2a %o ' 
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because the partial sums of the b(a, oo, 1) norm are uniformly bounded for 
/ G fia(ct,^fk) and using the smoothness of the variances for a 2 < e M a 2 . 
The second part of the divergence is 



n , 

Y- 



a 2 {i/n) 



1 -loe 



0"7 



a 2 (i/n) 



(71 



<^](logc7 2 (i/ra) -loga 2 



2\2 



i=l 



To bound this quantity requires taking advantage of all the smoothness in 
the variance functions. To simplify things a bit, we write r(i) = loga 2 (t) 
and t*- = (2j — l)/(2mo). The smoothness condition on r implies that r(t) = 
r(t*) + (t - t*)T'(t*) + E where the error term is \E\ < M\t - t*\ a . By the 



definition of aj, 



log = mi 



Q,jT(t) + Q + ijT(t + l/mi)dt 



+ m 1 r'{t*] 
r(t*) 



i/mi 
(£-l)/m 



[Qj(t -t*) + Q+i,j(t - t* + 1/mi)] dt + E x 



+ r'(t*) 



(21-1 2j-l 



V 2m\ 



2mc] 



+ Cm, 



21+1 2j-l 



2?7li 



2mc] 



+ E X . 



The error is an average over errors in the expansion and so \E\ \ < Mm 1 Ql . 
Plugging in Qj and Ce+i,j according to their definitions, 



0: 



21 - 1 2j - 1 



2mi 



2mo 



+ 1 2j-l 
2mo 



2mi 



mi 



mi 



Thus, loga] = r(tf) + Ei, and, fr om the bound on the derivative, |t(-) 
r(tj)\ < MuIq 1 implies that the bound is 



(7.11) 



(logo- 2 (i/n) -logcr 



nl/2 



2^,2 
3' 



i=l 



< Mn x l 2 m{ ai + Mn 1 



/2 



m r 



Combining (7.11) and (7.10) implies 



(7.12) A(V;,V)<e 



M l 2 n l l 2 m7 a 



a 7fe + Mn l ' 2 mx ai + Mn 1 



r- 



m r 



It is not necessary to do a specific calculation to bound <5('P,'P|) because 
Mallat's algorithm is invertible, and the deficiency distance between the 
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distributions after applying the inverse to both distributions can only be 
smaller. 

Putting together (7.1), (7.9) and (7.12), 
A(V k ,V) < 2e A/ /V/ 2 m -% C0 

+ 2Mn 1/2 m^ ai + 2M mj^mf 1 + Mn^m^ 1 + Mn 1/2 m^ 3/2 , 
which proves Lemma 3. 

8. The variance process. The final piece of the proof of Theorem 2 is 
to show that the variance observations in the simplified experiment can be 
transformed into a continuous Gaussian process. The experiment Q n ob- 
serves mi independent variance components Vg and a countable sequence 
of normal coefficients. Mimicking the results of [10] and [24], we can con- 
struct an independent Gaussian process V(t) from the x 2 observations, 
dV(t) = \oga 2 (t)dt + V^n" 1 / 2 dW 2 (t). The construction follows by taking 
the logarithm of the Vg and then using them to approximate the increments 
of V(t). 

Taking the logarithm of the Vg generates an intermediate experiment Q* 
that observes Zg ~ J\f (log a 2 , 2 ^ L ) all independent and the Gaussian process, 
conditional on the Zg, 

dY*(t) = f(t)dt + e z ^ 2 n- 1 / 2 dW(t) for tll<t<— . 

nil m\ 

The divergence between the distributions is 

mi 

D((log V, Y), (Z, Y*)) = Y, D(log Vg, Zg) + D(Y, Y*\ log V). 

l=i 

The first term is bounded as in Section 10.1 by m 2 /n and for the second 
the divergence term D(Y,y*|y) =0 because the conditional distributions 
are the same. 

In the other direction, the Vg are approximated by exp[Z^]. The divergence 
bounds are the same (these transformations are one-to-one and increasing), 
thus 

(8.1) A(Q, Q*) < minT 1 ! 2 . 

The scaled increments nii[V(£/mi) — V([£ — l]/mi)] from Q have the 
same distribution as the Zg from Q*. Thus, S(Q, Q*) = 0. 

To bound 5(Q*,Q) we need to construct the entire V(t) process via a 
smoothing operation on the Zg that is described in detail in [3]. Our argu- 
ment follows this reference very closely, so only an outline of the steps will 
be given. 
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The transformation uses triangular interpolating kernels 

2£-l 



Kg(x) = mi — m 



2m\ 



. 2£-3 2£ + l 

for < x < 

2mi 2mi 



with the appropriate reflections at the boundaries. The variable V(t) is 
constructed from the Zp, 



1/1,1 -t 

dV*(t) = V ZtK t (t) + V2U' 1 / 2 V — = dB e (t), 

ti ti 

where the Bi(t) are independent A^-Brownian bridges. This Gaussian pro- 
cess is (see [3]) 

dV*{t) =f(t)dt + V2n~ 1 / 2 dW 2 (t), 

where W2(t) is a standard Brownian motion and f is a piecewise- linear 
function where f(t^) = logtrf for t\ = 2 f^- and 

(8.2) f(t) = mi(t* t+1 - t)t($) + mi(t- t* e )f(t}+i) 

for i| < t < For t < or t > 1 — the function f is just a constant 
equal to f(^) or f(l - ^-), respectively. 

The smoothness condition on the r functions implies that 

r£/mi 

(8.3) f(t* e )=m 1 T{x)dx = T(t)+mi{t}-t)T{t) + Ei. 

J{l-i)/mi 

Likewise, f(t| +1 ) = r(t) + mi(i| +1 - t)r'(t) + E 2 . Therefore, plugging (8.3) 
into (8.2) yields \ f(t) — r(t)\ < 2Mm^ ai for t in the interior of the interval. 
Unfortunately, at the boundaries the error is of the order m^ 1 . Thus, the 
L2 distance between f and r is bounded by 

(8.4) ||f- t\\ 2 2 < 4M 2 m^ 2ai +M 2 mJ" 3 . 

The total- variation distance between the distributions of V*(t) and V(t) 
is of the order of this distance divided by the variance. Therefore, from (8.1) 
and (8.4), 

(8.5) A(Q, Q) < min~ 1 / 2 + 2Mn x l 2 m{ OL1 + Mn l / 2 m^ 12 . 
This proves Lemma 4. 

9. Divergence bounds for gamma distributions. 

Lemma 6. The K-L divergence between Pi = T{a.\,f3\) and P2 = T (02,(^2) 

is 

(9.1) D(P 1 ,P 2) <i^f# + of^i^ 

2a{ \ af 

when the means are the same (aifii = 02^2)- 
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The divergence between a pair of gamma distributions is 



D( 



1,^2 



(9.2) 



ai(/?i-&) 
+ log 



r(q 2 ) \ 



+ «2 log 

+ (ai - a 2 )Pilog 



Ji 



We can rewrite (9.2) just in terms of the ct's using the substitution ^ = 
D(Pi > P 2 ) = (a 2 -a 1 ) 

(9.3) 



+ a 2 log ( — ) + lo 

\a 2 



Let <5 = a 2 — ct\. As in the proof of Lemma 7, the last two terms can be 
bounded using the integral remainder of a Taylor series, 

T{ ai + 5)^ r-r-i rS 



log 



r(ai) 



<5Pi log 



V>'(ai + t)(5-t)dt. 



On the other hand, the first two terms in (9.3) have a similar Taylor series 
form, 

ai f s 5 — t 

(a 2 - a\) + a 2 log — = 5 - (ot\ + 5) log[l + S/ai] = 

(X2 



lo a + t 

The classical expansion in (12.2) implies that the K-L divergence is 



■dt. 



D( 



1,^2 



tp'(ai+t)(5-t)dt 
5 5-t 



t 



OL\ + t 

+ {5-t)0(a{ 3 )dt, 



dt 



2( ai + t) 2 

which gives the bound asserted by the lemma. 

9.1. What if we increase the degrees of freedom? In order to take into 
account the noncentrality of some x 2 distributions, we need a bound on the 
divergence between a T(ai,/3i) and a P2 = T(a 2 + A,/? 2 ) when ct\fii = a 2 /? 2 
and A > 0, 



D( 



1^2) 



(9.4) 



>(Pi,P 2 ) + Alog( —) 



+ log 



r(q 2 + A) 
T(a 2 ) 



APi lo 



x 
Ji 



From Lemma 7 



(9.5) 



log 



r(q 2 + A) 

r(a 2 ) 



AP 2 log 



x 

A 
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and the difference in taking the expectation with respect to Pi instead of 
P2 can be bounded using Jensen's inequality, 



Alog(^)+A 

\C*2 



log 



X 



log 



X 



m A 
< Alog — < 



1 — 1 ' 
a\ — 1 ol\ — 1 

Therefore, substituting this last inequality, (9.5), and Lemma 6 into (9.4), 

^ 2 A 2 A 



Df 



(9.6) 



T,JT 2 J ^ 1" o 1 7 

2af 2«2 cki — 1 

+ 0(A 2 a2 2 + (ai - a 2 ) 2 af 3 ). 



10. Divergence between normal distributions. If there are two normal 
stributions with means /ii and 1^2, 
then the divergence between them is 



distributions with means ii\ and H2, and variances o\ and cr 2 , respectively, 



(10.1) 



B(Ni,N 2 ) 



1 



r^2 



1 - log 



0i 



<7- 



+ 



(W -/M2) 5 



2a 2 



10.1. XTie logarithm of the gamma distribution. Suppose that X has a 
T(a,l) distribution and W = log(X). Let f(w) be the density of W which 
is approximately normal for large a. Let (j> a (z) be the density of a Z ~ 
AA(loga, a -1 ) distribution. The K-L divergence between these distributions 
can be bounded using Stirling's formula, 



D(Z, WO = logr(a) - log(V27r 



a exp 



1 

2a" 



1 



+ iloga + ae 1/(2a) 



1 

- + < 

2 12a 



a log a 



1 

3a 



for a > 1/2. 



To extend this bound, notice the logarithm of a T(a,f3) random variable is 
a shift of the distribution of logX. Therefore, if X ~ T(n/2, 2<r 2 /n), then 
D(Z,log(X)) ^n" 1 where Z 
A.3. 



AA(log cr 2 , -). Compare this to [14], Lemma 



11. Proof of Lemma 5. To bound the divergence between the sums, we 
define some similar random variables X* ~ F(di(l + rj)n, — ). The definition 
of n implies that J2$i r i = 0) an d thus the distribution of the sum of the 
X*'s is the same as the distribution of Y. It is necessary that 1 + r% > 0, but 
the bound is only interesting for small n anyway. 

We bound the divergence between the sums by the divergence between the 
joint distributions, D(£*i,E**) < D((X 1; . . . ,X m ), (Xf, . . . ,X^)) = 
E^D(X,,X*). 
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The divergence can then be bounded using 

i;D(x j ,x")=f;io g [ r(<5i(1+r - ) " ) 

i=l i=l L 



(11.1) 



5irinElogXi + E 



:( To "9 

<H erf 



From Lemma 7, the first two terms of (11.1) are 
T(<5jra + Sinn)' 



log 



(11.2) 



r(M 

-a>,nlog 



0>iraElogA^ 



n 



+ 



+ 



Hi i 



+ o 



Sir? 



n 



+ kil 6 + SArA 



4 6 12 

In summing over i, the first term is nJ^iLi Siri(loga 2 — logn) = nJ^iLi Sir 2 
because ^5iri\oga 2 = Y^SiTilogn = 0. Summing over all the terms in (11.2) 
yields a bound on the contribution from the first two terms of (11.1), 

nSirf 



(H-3) E 



2 + 4 



77 no~,r? nJ,rf 



G 



+ 



12 



+ 0(n5 i |ri| 5 + |ri| 3 + <5 i r?rr 1 ) 



The last term in (11.1) is 
1 1 



E 



(11.4) 



nX 



a 2 a? 



n5i(e r > - 1) 

nSir 2 nS^rf nSirf 



nSin + 



-5- + 



+ 



+ 0(nSi\ri\ 



2 ' 6 ' 24 

By summing (11.4) over i and then adding it to (11.3), we bound the diver- 
gence by 

A r 2 

i 

4 



i=l 



12. Digamma bound. 



Lemma 7. For X ~ T(a, 1), 



log 



r(a + «5) 



JElogX 



1 



1 



2V« + ^ + ° (a " 3: 



6 \a 



2 +o(a- 3 ; 



+ 



24 la 



+ 0(a~ 4 ) +0(<5 5 a~ 4 ) 



as a — ► oo and o~/a — ► 0. 
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This lemma is essentially a result on the properties of gamma and "polygamma" 
functions, 



d 

da 



x a - l e~ x dx 



/ (logx)x a - 1 e- x dx = T(a)ElogX. 
Jo 



Thus, 



log 



r(a + 5) 

r(«) 



5ElogX = log 



T(a + 5)\ r'(a) 



r(a) 



T(a) 



The (n + l)th derivative of the logarithm of the gamma function is the 
polygamma function ip^ n \a). Thus the expression can be seen as a Taylor 
expansion of \ogT{k) around k = a, 



log 



(12.1) 



T(a + S) 



:^(0)( Q ) + %(!), 



+ ^ (2) («) + ^ (3) («) + 0(<5 5 ^ (4) («))- 



There is a classical result, 
(12.2) 



V> (1) («) = ^ + ^ + o( 



a 



and plugging the derivatives of this function into (12.1) gives the desired 
expansion. 



13. Bound. In Section 7.2.2, we need a bound on the quantity 

■2 



9a* (X) n -m 

Si = Elog 1 = — ^ log 

g at [X) 2m r 



o 



jeJt 



a 



*2 



-2 „*2 



(17 



a;: 



To straighten out this sum we need to separate it into two sets of j's, 



J 



£-1 j 21 - 1 

3 ■ < — < 

rri\ mo 2m\ 



and Jn 



21 -\ j £ 

3-- < — < — 

zmi mo mi 



so that J} U ,7/ = Ji. For j G j} the variances cr|* are a e 3 a t \ 1,J , and for 

Of Of 

j G i7/ the variances cr|* are a k ' j 07. ^ 1 "' . Thus the sum can be written as 



57 



ra — mo 
2m,Q 



+ 



re — mo 
2mo 



07 



1/4 3/2 1/4 



+ 



a* 



r 2C»,- 2(l-<^,) 



-1 



1/4 3/2 1/4 
°£-l°£ °M-1 



1/4 3/2 1/4 
a i-\ a l a e+i 



+ 



a 



m 
a] 



2C e ,j 2(l-Ci,j) 



07 0" 



€+1 



1/4 3/2 1/4 
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Note that at the edges, where the I = 1 or t = mi, this equation is still true 
if we define cr 2 , = o\ and cr^ i+1 = . 

Setting r£ = log erf, ^ — log erf and rt-\ = log erf — log<rf_ 1 , we can write 
SV as 



5/ 



n-m ^ (r^_i - r^) 



2mo 



E 



+ exp[-r £ „i(l - Qj)] 



1 - exp -(r^_i -r£ 



+ 



n-m (r^_ x - r £ ) 



2mo 



E 



+ exp[r^(l - Oj)] 



1 — exp 



Qta-i-r<)) 



< 



n — mo 
2mo 



E(^p)[i-^- l(1 " c - } ] 

[1 _ e r f(i-0,j) 



E 



< 



n-m /r^_i 



2mg 



E r ^-i( 1 - - E - 



By the definition of the weights Qj , each one is between and 1 and the 
sums of them are Ejej; (1 - Q,j) = Ej G ^ 2 (l - Ce,j) = |(^)- Therefore, 



(13.1) 



3(n — m ) 



128mi 



(r^_i - r^) . 



The definition of the function class says that the function r(t) = log(cr 2 (i)) 
is smooth in the sense that r(t + 5) = r(t) + cV(i) + E where \E\ < M5 ai . To 
use this notice that, for 1 = 1,... ,m\ — 1, expanding the functions around 
{2l-l)/2m x , 



-l = m 1 



mi 



l/2mi {21+1 



-l/2mi V 2mi 

l/2mi 



2£- 1 
2mi 



2^-3 
2mi 



l/2mi 



Ei - 2E 2 + £ 3 dt, 



and the average value of the errors is less than mi SLu^mi l-Ei — 2E2 + 
-E3I < 5Mm~ Ql . Plugging this bound into (13.1), we have 

-M 2 (n-m )l 2ai 



(13.2) St < 



mi 



77; , 



for 1 = 2,.. ., mi — 1. 
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By definition, ro = r mi = and the only bound available for the first and 
last term is \r#\ < Mm^ 1 , thus 

r-t q n\ fQ a w \M 2 (n-m y 

(13.3) max(5i,5 mi ) < 



mi 
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